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Abstract 

With the advancement in the artificial intelligence technologies and develop- 
ment of fifth generation networks, a network may face many hazards and chal- 
lenges as the number of users are accessing the network simultaneously which 
makes the user to think of losing the confidentiality of the data and hence the 
network to be considered for security. Threats on the network can be classified 
in many ways and to detect such threats an Intrusion detection system (IDS) is 
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rithms; the one which is mainly used. A network can be attacked in two ways as minor 
Network; attack and major attack. Denial-of-Service (DoS) and Prob attacks belong to 


User-to-Root; 
Remote-to-Login 


major kind and User-to-Root (U2R) and Remote-to-Login (R2L) goes to minor 
attack categories. The minor attacks are also called as rare attacks which are 
very injurious for a host and it is very difficult to recognize these attacks. This 
paper consists of a survey made on IDS and different algorithms used to imple- 
ment these IDSs using machine learning. 


1. Introduction cations with available IDSs were they were unable 


to find the threats which were happening unknow- 


In today’s modern life, networks play a vital role 
which makes the security of internet as an interest- 
ing field of research. Many methods exist for net- 
work security such as firewalls, anti-malware soft- 
ware and Intrusion detection systems (IDSs), which 
helps the networks to protect against internal and 
external storming. 


Along with this, in order to provide security of 
network by taking care of the software and hard- 
ware on the network an IDS is used. In 1980 the 
first kind of intrusion detection system (Pattawaro 
and Polprasert) came to existence and since then 
many IDS methods were developed. But the compli- 


OPEN ACCESS 


ingly, such as switching in the network environ- 
ment. In order to find solution to these problems, 
human independent IDSs were developed which are 
built on the machine learning techniques. Machine 
learning belongs to the family of artificial intelli- 
gence which allows software applications to pro- 
duce more accurate outcome without human inter- 
vention by training a machine how to learn (Anita, 
S, and Gupta). Based on the sample data which is 
also called as training data these algorithms build 
a model in order to make decisions without being 
explicitly programmed to do so.The objective of this 
paper is to figure out the concept of IDSs and to dis- 
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cuss distinct machine learning algorithms which can 
be used to implement IDSs. 


2. Concept of IDS 


An application software used to take care of a net- 
work or system for malicious activity is called as 
Intrusion detection systems. These IDSs are very 
crucial for a network which monitors the host and 
networks, generate alarm based on the behavior of 
computer systems and also for any suspicious activ- 
ity. 

Intrusion detection systems acts as heart of a 
network structure whose work is to keep on moni- 
toring the network activities by regularly verifying 
the connection patterns and flow of packets through 
that network (Buczak and Guven). The advantage 
of these IDS is their ability to classify the network 
and packets based on the set of predefined parame- 
ters (Shah et al.). 

Based on the recognition methods used, the IDSs 
can be basically classified as detection based meth- 
ods and source based methods. Under detection 
based methods there are two categories such as sig- 
nature based detection (also called as misuse detec- 
tion) and anomaly detection. For the source based 
methods also there exists two classifications such as 
host based and network based methods (Masduki et 
al.) 
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FIGURE 1. 
Model 


2.1. Detection based methods 

As previously mentioned, these methods falls into 
two categories such as signature based detection and 
anomaly detection. 
2.1.1. Signature based Detection 


This method is generally known as_ signature 
based methods which is rule based (Umbarkar and 
Shukla). This kind of detection methods relies on 
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signature database which works by comparing the 
sample of signature with those in the database. But 
the drawback of this method is to compose well 
organized signatures (Lansky et al.). This method 
gained popularity because it gives the reports of 
attack types with the reasons caused and also it pro- 
vides low false alarm rate. Coming to the disad- 
vantage, this method has elevated missed alarm rate 
and it is poor on detecting the unknown attacks and 
maintaining a vast signature data is also needed. 


2.1.2. Anomaly Detection 


This kind of detection method is used to detect 
change in behavior. The working principle is to 
create a normal behavior profile and comparing the 
activity against that profile. If there is any devi- 
ation it triggers an alert. Therefore, it is manda- 
tory in this detection method to create a normal pro- 
file (Khosravi-Farmad, Ramaki, and Bafghi). Ben- 
efit of this method is that it supports good generic 
and it is very efficient in finding unknown attacks. 
Shortcoming is that false alarm rate is more and it 
is inefficient to provide the reasons for irregularity. 
Distinction between signature based detection and 
anomaly detection is listed in Table]. 


2.2. Source of data based methods 


Under this category there exist two methods such as 
host based IDS and network based IDS. 


2.2.1. Host based IDSs 


These are the entities which are based on software 
and installed on a host computer to inspect and take 
care of all the congestion happening on the sys- 
tem application files and operating systems. Benefit 
of this method is their ability to detect hazards by 
checking the congestion happening in the network 
before exchanging of data. The disadvantage is that 
only the host system is monitored i.e these IDSs has 
to be put on every host (P. Singh, S. P. Singh, and 
D. S. Singh). 


2.2.2. Network based IDSs 


In contrast to host based IDSs, these IDSs are hard- 
ware based and need to be installed to monitor net- 
work congestion. Since these IDs are not depen- 
dent on OS, they can be installed in any OS envi- 
ronment (Hindy et al.). Benefit of this method is 
their ability to find types of protocol specifically and 
attacks on network. But the shortcoming is only the 
network segment through which traffic is moving 
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TABLE 1. Differentiation among signature based detection and anomaly detection 


Signature based 


Performance __ It has gaint missed alarm rate and 

Recognition false alarm rate is small 

Proficiency It has elevated efficiency detection 

Detection and it reduces with signature 
baseline scale. 

Domain All findings are dependent on the 

Knowledge domain knowledge 

dependency 

Explication Since designs hang on the domain 
knowledge provides strong 
interpretative. 

Unspecified Only known attacks are detected. 

attack 

Recognition 


will be taken care. Differences between host-based 
and network-based IDSs are listed in table 2. 


3. Machine Learning Algorithms used in IDS 


Machine learning belongs to the family of artificial 
intelligence which uses software application to pro- 
duce more accurate results without being explicitly 
programmed to do so. Based on historical data they 
produce new output values (Zhengbing, Zhitang, 
and Junqi). Since it is data driven method, knowing 
the type of data acts as the basic step. This section is 
to know about various machine learning algorithms 
used to implement IDSs. Support as SVM (support 
vector machine), KNN (K-nearest neighbor), ANN 
(artificial neural networks), LR (Logic regression), 
Naive Bayes, decision tree, clustering and hybrid 
methods. 


3.1. Artificial Neural Network (ANN) 


The design idea of ANN is based on the brain activ- 
ities derived from the human behavior. These are 
the collection of connected nodes called as artifi- 
cial neurons. An ANN consists of input layer, hid- 
den layers and output layer, where all the nodes are 
fully connected. In order to get work done all these 
nodes need to be trained before implementation, but 
training these nodes is very time-consuming because 
of its tedious structure. Training of ANN models 
is generally done using back propagation algorithm 
which cannot be used in deep network training (Gu 
et al.). 


Anomaly Detection 

It has gaint false alarm rate and missed alarm rate 
is small 

Provide efficiency based on the complexity of the 
model 


Provides low domain knowledge dependency 
because only the characteristics design hang on 
domain knowledge. 

Provides weak interpretative because it gives only 
detected results. 


Both known and unknown attacks are detected. 


Expected 


Results 
(Target) 


— Neurons —— 
Signal 
Weighted 


Adjust 


FIGURE 2. Block diagram- ANN for pattern 
classification. 





Analvsis 


FIGURE 3. Diagram of Activities in an ANN 
model 
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TABLE 2. Differentiation among host-based and network-based IDSs 


Data Source 


Host-Based IDS 
Depends on operating system or applica- 
tion program records 


Categorization Since every host is to be equipped with 
IDs , difficult to categorize. 
Efficiency of finding Less efficient because it has to check 


Network-Based IDS 
Depends on Network congestion 


Easy to categorize since it depends 
on network models. 
Since attacks are found in real time 


number of records each time. 
Intrusions are traced based on the system 


Traceability of Intrusion 
call paths. 


Disadvantage 


3.2. Support Vector Machine (SVM) 


The procedure of using SVM is to differentiate max 
margin hyperplane in the n-dimension feature space. 
Since the separation of hyperplane can be done even 
with small amount of support vectors these methods 
provide great results (Pressley). But near the hyper- 
plane these methods are delicate to noise and also 
they are capable of solving linear problems. 
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(Support Vectors) 

FIGURE 4. Schematic diagram of SVM struc- 
ture 


3.3. K-Nearest Neighbor (KNN) 


This is one of the basic method of classification 
which is non -parametric and very efficient in clas- 
sification. The sample is said to have probability 
of belonging to the class if most of its neighbors 
belong to the same class (Tirumala, Sathu, and Sar- 
rafzadeh). The performance of KNN is dependent 
on parameter K which is found by the user. Accord- 
ing to the sample test taken, K training points are 
selected by considering the nearest distance to the 
test sample and hence the performance of KNN is 
greatly dependent on the parameter K. Based on the 


Behavior of network cannot be analyzed. 


these are highly efficient. 

Based on IP address and time 
stamps, time and position of intru- 
sion is detected. 

Traffic only through the particular 
network segment is taken care. 


value of K complexity and fitting ability of model is 
decided. 








Read Values of K, 
Type of Distance (D) 
& Test Data 


Set Maximum Label 
class of K to Test 


FIGURE 5. KNN classification Algorithm 


3.4. Logistic Regression (LR) 


For solving the classification problem these models 
are used which works by computing the probabil- 
ity of different classes by considering the parametric 
logistic distribution (Dias et al.). These models are 
efficient due to their capacity of furnishing probabil- 
ities and to divide new data based on continuous and 
discrete datasets. Logistic distribution is calculated 
with the formula given below. 

P(Y= k—x) = Serr 


where k = 1,2...K-1. 
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Unit Step 
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FIGURE 6. Logistic Regression model 


3.5. Naive Bayes 


Working principle of this method is conditional 
probability and attributes independence hypothe- 
sis (Shi et al.). These network acts as directed 
acrylic graphs in which every node acts as a dis- 
crete random variables of interest. The conditional 
probability table (CPT) is maintained, where each 
node carry the random variable state, which is used 
to specify the conditional probability of the domain 
variables with other connected variables. The for- 
mula for conditional probability is given below. 
PX =a) ¥ =) a Pea a y= 





FIGURE 7. Naive Bayes classification 
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3.6. Decision tree 


These algorithms are used in classification problems 
for learning and modelling a dataset. Classification 
of new data set is done based on what it has got from 
the previous dataset (Wasi et al.). By using deci- 
sion tree algorithms , inappropriate and unnecessary 
features are excluded automatically. The learning 
steps involved in this model is, selecting the feature, 
generating tree and tree pruning. In order to train 
a decision tree model most appropriate features are 
selected individually and child nodes are generated 
from it. 
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FIGURE 8. Decision Tree Hierarchy 











3.7. Clustering 


It is the task of segregating the data points into 
a number of groups, such that highly similar data 
are grouped into one cluster and less similar data 
are grouped into another cluster. Benefits of this 
algorithm are they don’t need prior knowledge. It 
is mandatory to refer external information while 
detecting attacks using clustering algorithm. K- 
means is an example of clustering algorithm (Chen 
et al.). 

K-means:It is s an exemplar of clustering algo- 
rithm which makes the use of Euclidean distance 
to compute centre of cluster and data. K refers to 
the number of clusters and mean stand for attribute 
mean. The idea behind this algorithm is to achieve 
less distance inside the same cluster and to have 
maximum distance between clusters. Advantages 
and disadvantages of various forward propagation 
models (shallow models) are listed in Table 3. 


4. IDS Feature Selection 


With respect to the machine learning classifica- 
tion two phases are involved such as classification 
phase and training phase. Dispersal of the charac- 
teristics is done in the course of training phase and 
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Algorithms 
ANN 


SVM 


KNN 


Naive Bayes 
LR 
Decision tree 


K-means 


Advantages 

Have good fitting ability and they are used to 
deal with non linear data. 

Even with small train set useful information is 
learnt. They have strong generation ability. 


*Easy to train *Best suited for non linear data. 
*For massive data also implemented *Strong 
against noise. 

*Incremental data learning can be done. 
*strong against noise. 

*Tmplementation is simple. *Training can be 
done very fast. 

*Selection of features can be done automati- 
cally. *Interpretation is very strong. 
*Implementation is simple. *training of data 
can be done rapidly. 
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TABLE 3. Advantages and disadvantages of various shallow models 


Disadvantages 

*Training of this model is time 
consuming. *Liable to overfitting; 
*Difficult to perform on big 
dataset using this algorithm. *For 
kernel related parameters they are 
very delicate. 

* Testing time is more. *Delicate 
to the K parameter. 


Difficult to perform on attribute- 
related data. 

*Cannot perform on nonlinear 
data; *Liable to overfitting 
Correlation of data is ignored. 


*Tnitialization is delicate. *Deli- 
cate to the parameter K 


learned features are put in as normal profile during 
the classification phase where any abnormality will 
be detected. 
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FIGURE 9. Design of Machine learning classifi- 
cation process 
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5. Conclusion 

Classification of machine learning algorithms is 
focused in this survey and also machine leaning 
based IDSs are summarized which are implemented 
in the security of networks. 
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